# Multimodal model

PP Chart2Table
Apache-2.0
PP-Chart2Table is a multimodal model developed by the PaddlePaddle team, focusing on Chinese and English chart parsing and capable of efficiently converting charts into data tables.
Image-to-Text Supports Multiple Languages
P
PaddlePaddle
1,392
0
Gemma 3 4b It Qat GGUF
Gemma 3 is a lightweight, advanced open model series from Google, built on the same research and technology used to create Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.
Text-to-Image English
G
unsloth
2,629
2
Llm Jp Clip Vit Base Patch16
Apache-2.0
Japanese CLIP model trained on OpenCLIP framework, supporting zero-shot image classification tasks
Text-to-Image Japanese
L
llm-jp
40
1
Paligemma Longprompt V1 Safetensors
Gpl-3.0
Experimental vision model combining keyword tags with long text descriptions for image prompt generation
Image-to-Text Transformers
P
mnemic
38
1
Paligemma 3b Mix 448 Ft TableDetection
A multimodal table detection model fine-tuned from google/paligemma-3b-mix-448, specialized in identifying table regions in images
Image-to-Text Transformers
P
ucsahin
19
4
Paligemma Rich Captions
Apache-2.0
An image caption generation model fine-tuned on the DocCI dataset based on PaliGemma-3b, capable of generating detailed descriptions of 200-350 characters with reduced hallucination
Image-to-Text Transformers English
P
gokaygokay
66
9
Compare2score
MIT
Compare2Score is a model for image quality assessment that provides a quality score for images through a specific algorithm.
Image Enhancement Transformers
C
q-future
391
4
Vit Medium Patch16 Clip 224.tinyclip Yfcc15m
MIT
CLIP model based on ViT architecture for zero-shot image classification tasks
Image Classification
V
timm
144
0
Siglip Large Patch16 384
Apache-2.0
SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.
Image-to-Text Transformers
S
google
245.21k
6
Siglip Large Patch16 256
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, utilizing an improved sigmoid loss function to enhance performance
Image-to-Text Transformers
S
google
24.13k
12
Siglip Base Patch16 512
Apache-2.0
SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved sigmoid loss function, excelling in image classification and image-text retrieval tasks.
Text-to-Image Transformers
S
google
237.79k
24
Chinese Clip Vit Large Patch14
Chinese CLIP model based on Vision Transformer architecture, supporting cross-modal understanding and generation between images and text.
Text-to-Image Transformers
C
Xenova
14
0
Siglip Base Patch16 224
Apache-2.0
SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function to optimize image-text matching tasks
Image-to-Text Transformers
S
google
250.28k
43
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase